Adsorbate chemical environment-based machine learning framework for heterogeneous catalysis

Heterogeneous catalytic reactions are influenced by a subtle interplay of atomic-scale factors, ranging from the catalysts’ local morphology to the presence of high adsorbate coverages. Describing such phenomena via computational models requires generation and analysis of a large space of atomic configurations. To address this challenge, we present Adsorbate Chemical Environment-based Graph Convolution Neural Network (ACE-GCN), a screening workflow that accounts for atomistic configurations comprising diverse adsorbates, binding locations, coordination environments, and substrate morphologies. Using this workflow, we develop catalyst surface models for two illustrative systems: (i) NO adsorbed on a Pt3Sn(111) alloy surface, of interest for nitrate electroreduction processes, where high adsorbate coverages combined with low symmetry of the alloy substrate produce a large configurational space, and (ii) OH* adsorbed on a stepped Pt(221) facet, of relevance to the Oxygen Reduction Reaction, where configurational complexity results from the presence of irregular crystal surfaces, high adsorbate coverages, and directionally-dependent adsorbate-adsorbate interactions. In both cases, the ACE-GCN model, trained on a fraction (~10%) of the total DFT-relaxed configurations, successfully describes trends in the relative stabilities of unrelaxed atomic configurations sampled from a large configurational space. This approach is expected to accelerate development of rigorous descriptions of catalyst surfaces under in-situ conditions.

: Atomic and bond attributes used to generate the crystal subgraph network motifs. a Log scale is used to encode this property. b One-hot encoding (this work) or Gaussian basis (elsewhere 1 ) can be used.

Model architecture
The forward pass for the ACE-GCN, once subgraphs are generated for high coverage adsorbate configurations, is done by applying graph convolutions on the individual subgraphs, as implemented in previous accounts, 1,2 where node features ( ) of all nodes ∈ in a graph = ( , ) are iteratively updated by aggregating localized information from their neighbors ( ).
Training a GNN for global graph regression is performed as follows: 1. Embed each node by performing multiple rounds of graph convolutions 2. Aggregate node embeddings into a single graph embedding (readout layer) 3. Train a fully-connected neural network on the graph embedding with output as the target property of choice

a. Pooling operation through supplemental indexing
Hierarchical pooling operations are employed in the ACE-GCN model to allow encoding of arbitrary sized subgraphs. As such, the model can account for a variable number of neighbors, coverages, and different types of binding sites on the catalyst surface. Every subgraph, node, and neighbor are labeled by a supplemental indexing scheme, as shown in the figure. Thus, when pooling the neighbors, the node index entry can be used to combine the features. A similar pooling scheme is extended to node properties and finally to the subgraphs themselves. The pooling operations are performed through PyTorch Geometric's Scatter method (https://pytorch-scatter.readthedocs.io/en/latest/). Through this method, elements in the input matrix of known dimensions can be reduced (via summation or mean operation) by explicitly specifying the indices which must be used for the said reduction. As a result, through successive hierarchical pooling, every high coverage configuration entry which is encoded as multiple subgraphs is condensed to a single fingerprint vector.  Table S3 lists various model hyperparameters considered when constructing the ACE-GCN model. Besides the GCN module, the type of graph object and node encoding being developed can also be varied. The spatial bond attribute can be either expressed as simple one-hot encoding (as done in our case) or as Gaussian-based feature expansion (done in previous accounts 1 ). We found that one-hot encoding worked best in our case since we seek to use the initial optimized configuration for predicting the stability of the converged high coverage adsorbates (see below for additional discussions). One hot encoding is insensitive to small fluctuations arising in the bond length from DFT relaxations (Refer Figure S3), whereas Gaussian-based featurization can be sensitive to changes in bond length, causing the resulting bond feature, during graph object generation, to vary significantly between relaxed and unrelaxed structures. These attributes are explained in detail in the next section. In summary, it is possible to modulate the model's sensitivity to the bond features using the two featurization techniques.

b. Model hyperparameters
Appropriate set of hyperparameters for NO*/Pt3Sn and OH*/Pt analysis were chosen through a grid search approach, wherein multiple set of ACE-GCN model parameters were considered and the set which provided lowest mean absolute error (MAE) for the validation set were chosen. As a starting point for the model optimization, hyperparameters used by Xie et. al. were selected. Overall we observed that the model performance was most affected by the learning rate while other hyperparameters had less pronounced effect on the MAE. During the pretraining (incremental) of higher coverage configurations the same model employed in lower coverage were used.  c. Spatial bond attribute encoding i.
Effect of type of spatial bond attribute encoding. (A) and (B) plot the encoding vector generated for one-hot encoding and the Gaussian-feature expansion for a bond length. Considering 1.2 and 1.4 Å as examples, the bond feature for one-hot encoding will be the same, resulting in the same encoding in both cases. However, for the Gaussian-based encoding, there is a distribution of non-zero values, corresponding to the likelihood of that bond length being from Gaussian distributions that are generated in the user-defined range. Hence the two bonds, varying slightly in length, will be encoded differently. This sensitivity can be modulated by tuning the parameters generating the encodings, like the number of intervals, the mean, and the standard deviation of the Gaussian kernel employed.
Nevertheless, there will always be some value associated with the bond features in the vicinity of the actual bond length value. This sensitivity, while useful for encoding the differences in the length, can make generalization of the model (trained on relaxed structures) to unrelaxed configurations difficult. As discussed in the next section, small deviations in the edge property may result in ACE-GCN network associating it with large energetic changes, since those are very different from the training set bond features. To mitigate such a response, one-hot encoding is chosen.
For the reasons discussed above, use of a Gaussian feature expansion could result in an ACE-GCN model which is highly sensitive to bond distances. This is not necessarily required when screening many different configurations where the major variations are expected to result from differences in binding environments (neighboring nodes) rather than spatial bond lengths. The one-hot encoding helps in modulating the model's sensitivity to such effects so that less emphasis is placed on the fluctuations in the bond distances, thus permitting systematic screening of high coverage configurations using initial optimized structures as guess atom positions. ii.
The effect of spatial bond encoding on the ACE-GCN model predictions. The boxplot ( Figure S4) shows the predicted energy difference for a given configuration between its unrelaxed state and post-DFT relaxation. The plot to the left is the generated by encoding bond distances as one-hot encodings, whereas the energy difference in the right was plotted for cases where bond distances were encoded as Gaussian window (refer Figure S3 for more details). All other model hyperparameters were kept the same during the network training. The energy difference (y-axis), as predicted by ACE-GCN for relaxed and unrelaxed cases, are plotted as a function of the net coordination number difference (x-axis). It is seen that the range, given by spines of the box plot, is larger for the predictions with Gaussian encodings compared to one-hot. The inter-quartile range is also larger for the Gaussian encodings, suggesting more variance. The table below each box plot lists the key statistics for the energy difference (y-axis of the box plot) estimated for each net coordination difference (x-axis of box plot). In particular, for cases where binding sites did not change post DFT-relaxation (given by coordination difference of 0), the mean and standard deviation for Gaussian window encoding is larger than that for one-hot encoding. Similar conclusions can be drawn for those belonging to other cases (refer to the mean and standard deviation column entries in Figure  S4). Ideally, to systematically sample low energy configurations, an appropriate initial guess for a strong binding adsorbate, such as NO* on Pt3Sn(111), should be sufficient for choosing candidates for subsequent, expensive DFT relaxation. To facilitate this type of analysis in the ACE-GCN formalism, we use an encoding which is more sensitive to the chemical geometric environment of each atom in the graph than to the intrinsic bond length of every atom pair (edge property of pairwise nodes in the subgraph). Doing so allows us to propose unrelaxed structures which often result in stable configurations after relaxation. Since the model is trained on converged configurations, such an approach ensures that the model energy predictions are not overly sensitive to the bond distances, which would make predictions using initial configurations more difficult. In general, while either approach could be used with ACE-GCN, we conclude, on balance, that the one-hot method is more useful for our intended purposes.

NO/Pt3Sn(111) Analysis
a. Screening high coverage configurations using the initial (unrelaxed) structures Figure S5 compares average NO* binding energy predictions from ACE-GCN (x-axis) with the DFT energies of the corresponding (fully relaxed) configurations plotted on the y-axis (see Methods for details of the calculations). All NO* configurations identified in the analysis are plotted in the diagram. Statistics for each coverage are listed in Table S4. A few of the configurations predicted to be unstable, as per initial (unrelaxed) guesses, relax to stable arrangements after DFT optimization. Such a relaxation is attributed to the shifting of NO binding location. The table below each plot describes the statistics for the given NO* coverage. The ACE-GCN model can be used to predict binding energies on the unoptimized geometries or the optimized geometries. The latter comparison is accurate and closely follows the DFT predictions, as the structures of both configurations are very similar. Looking at the error (∆BENO) column, the error for the DFT/optimized column has mean of 0.01 eV for 4/5/6-NO* coverage. Parity plots for comparing the ACE-GCN predicted energies and DFT-optimized energies are discussed in subsequent sections. See main text for descriptions of how the algorithm was retrained for each higher coverage level.  Interestingly, the median for energy difference increases, that is, structures relax considerably more, when the initial arrangement is unstable. Hence, the representations predicted to be unstable by ACE-GCN undergo most relaxation and show the highest energy change.
The results demonstrate that for all coverages considered, the initial configurations enumerated by SurfGraph span all the ACE-GCN energies associated with the DFTrelaxed configurations. This suggests that we do not lose information when we consider only the configurations which have not changed in terms of overall NO* coordination during DFT relaxation, as plotted in Figure 3 (main text). c. Representative stable/unstable configurations identified by ACE-GCN To showcase the non-linear mapping, from geometric fingerprints to target, being developed by ACE-GCN in the NO* energy predictions, selected configurations for 4/6-NO* coverages are shown in Figure S7 with corresponding energetics listed in Table S5. The site difference column in Table S5 denotes the change in net NO* coordination number before and after DFT relaxation (see discussion above).
In the high energy (unstable) case, as observed from in our previous analysis, 3 NO* molecules are primarily bound to the bridge sites, with few occupying threefold sites. For the low energy (stable) cases, top site occupancy is found to be very common. This is true for both the 4-NO* and 6-NO* cases and is predicted independently by ACE-GCN with no need for further DFT optimization. Finally, in cases where structures generated by SurfGraph undergo considerable reconstruction, the energies in the DFT-relaxed structures (both DFT and ACE-GCN) are, not surprisingly, quite different from the initial energies. In all cases, if given a DFT-relaxed configuration, ACE-GCN predicts average NO binding energies that are close to the DFT-relaxed energy.

Table S5. DFT-estimated and ACE-GCN-predicted (on relaxed and unrelaxed configurations) BENO.
For configurations with no reconstruction, the net NO* coordination difference columns is 0.0, and the model captures the energetics quite well. In all cases, ACE-GCN correctly predicts DFT energies if given the final, DFT-relaxed structures.  Table S5. A single periodic unit cell is depicted in each case.

d. Parity plots for training of ACE-GCN model
The ACE-GCN model performance when trained on the 123-NO* dataset is shown in Figure S8(a). The performance is improved when some portion of the 4-NO* points (the 100 most and least stable configurations, as ranked by ACE-GCN) are incorporated in the model training and validation ( Figure S8(b)).The 4-NO* dataset, comprising 200 points, is split into training and testing sets in an 80-20 ratio. The testing set (40 points) is kept aside for testing ACE-GCN model performance. The training set (160 points) is included in the model training and validation in (b) where the ACE-GCN model is trained / validated on the combined dataset of 1/2/3-NO* + 4NO* points.  Figure S5. This analysis is as per discussion in Figure 3, main text. (i) and (ii) highlight the region of low (and high) energy in the plots and the representative NO* configurations observed post-DFT relaxation. Figure S9 depicts ACE-GCN performance using an incremental training strategy, wherein a limited number of DFT-relaxed energies at given NO* coverages are systematically added to the training set and used to predict energies on the DFT-relaxed configurations for the subsequent coverage level. All of the 1/2/3 NO* configurations are included in the model training, but only 80% of total possible 4/5/6 NO* configurations are considered in the network training exercise. The remaining 20% are held out for testing purposes (holdout test set). The performance of the ACE-GCN model on the held-out test set of 5/6 NO* is plotted below each training set parity curve. Adding more DFT-relaxed energies to the training sets, of course, slightly increases the accuracy of the ACE-GCN predictions, but at the price of reduced computational efficiency. In this analysis, we consider all NO* configurations and test the ACE-GCN model prediction error as a function of different training data portions. All 1,2,3-NO* points are used in the training set. For 4/5/6-NO* data, the dataset is randomly split in a 80-20 ratio. This way, the 80% of the data (hold-out training) is gradually incorporated in the model training, and the remaining 20% of data (hold-out testing) is reserved for estimating the model performance (e.g. estimating the MAPE, Figure S10, using only the test data). As shown in Figure S10, considering the 4-NO* case, gradually introducing more data, out of the 80% reserved dataset, increases the ACE-GCN performance for predicting 4-NO points. The performance plateaus when around 50% of the hold-out training data is included in the model training. A similar analysis is conducted for 5/6-NO*.

OH/Pt(100) and Pt(221) Analysis
Screening high coverage OH* configurations on Pt(221) Figure S11 shows all configurations for 4/5/6 OH* that were selected for DFT optimization. Some of these configurations resulted in OH bonds being dissociated (marked by light blue points). Although these configurations have low DFT-relaxed binding energies, they will likely be less stable if water solvation effects are considered, and the ACE-GCN algorithm is not configured to rigorously analyze such structures. Hence, these configurations are not considered in the model training. For each OH* coverage, SurfGraph is used to create the initial configurations. The ACE-GCN model then predicts the average BE for those configurations. DFT relaxation is carried out on candidates selected from the most stable and unstable regions to rigorously test the accuracy of the ACE-GCN model. The number in the inset shows the ratio of number of configurations that had intact OH* to the number that contained dissociated OH* (out of 400 cases), with total possible enumerations mentioned in the bracket. The analysis presented here is equivalent to that presented in Figure 3 (main text), although here dissociated configurations are plotted, as well. Figure S12 explores the effect of using different datasets for training the ACE-GCN model for ranking 4-OH* on Pt(221) configurations. Energy post-DFT relaxation is used for validating the ranking. In all cases, the same set of 400 points of 4-OH* (as discussed in the previous section and the main text) are used for ACE-GCN energy predictions (shown on the x-axis), while the y-axis is the DFT-relaxed energy. A) Only Pt(100) OH points are used for training ACE-GCN model, which is then used to predict energies for the 400 4-OH* Pt(221) cases. Here, the ACE-GCN predicted initial energy does not correlate with the final energy. B) Only 1,2,3-OH* on Pt(221) configurations are used in the model training. There is improvement in the ranking of 4-OH* structures. C) The combined dataset of Pt(100) and Pt(221) points is used for training, and this ACE-GCN is used to rank the 4-OH* on Pt(221). This model shows further improvement in ranking the 4-OH* configurations, especially in the low energy (stable) region.  (100) and Pt(221) datasets, in the right-most parity plot, yields slightly improved parities. Figure S13 shows parity plots for the variation in the ACE-GCN model prediction performance as higher coverage (4/5 OH*) cases are incorporated in the model training.

Influence of different OH* training sets on 4-OH* prediction
In all cases, a random 80-10-10 split to generate training, validation, and testing is used for network training and prediction. Starting from Figure S14 (left), the parity plot shows the performance of ACE-GCN in estimating the average binding energy of DFT-relaxed 4 and 5 OH* configurations with the ACE-GCN model trained only on energies and geometries of DFT-relaxed Pt(100) 1-5OH and Pt(221) 1/2/3 OH* points. Next, following the incremental training strategy described in Figure 1 (main text), the ACE-GCN model is retrained on selected non-dissociated DFT-relaxed 4-OH* points (320 cases). The model performance is shown in the middle parity plot for predicting energetics on DFTrelaxed 4/5 OH* points. Next, this model was retrained using the selected non-dissociated 5-OH* (273) cases and incorporated into the model training. The prediction error for 4/5OH* data points, especially 5OH*, is seen to decrease, given by lowering MAPE and increasing R 2 from left to right, as the model is sequentially trained on higher coverage cases.